Source: Kaggle - Heart Failure Prediction Dataset
Description: This dataset contains 12 variables including age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, oldpeak (ST depression induced by exercise relative to rest), slope of the peak exercise ST segment, and the number of major vessels colored by fluoroscopy. The target variable is the presence or absence of heart disease. This dataset aligns with the project guideline’s requirements of having at least 10 variables and 100 observational rows, with both quantitative and categorical variables.
Datasets Name: hearts
Website to Download the Data: Kaggle - Heart Failure Prediction Dataset URL : https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.
People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.
The project aims to identify key predictors of heart disease severity and provide a comprehensive analysis that can aid in early detection and better management of cardiovascular diseases. By leveraging statistical and machine learning techniques, the findings will contribute to the understanding of heart disease risk factors and support the development of preventive measures
Data Import: To load the dataset into R for analysis. Check for Missing Values: To ensure data completeness and integrity. Data Conversion: To ensure categorical variables are treated appropriately in the analysis.
## # A tibble: 6 × 12
## Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 40 M ATA 140 289 0 Normal 172
## 2 49 F NAP 160 180 0 Normal 156
## 3 37 M ATA 130 283 0 ST 98
## 4 48 F ASY 138 214 0 Normal 108
## 5 54 M NAP 150 195 0 Normal 122
## 6 39 M NAP 120 339 0 Normal 170
## # ℹ 4 more variables: ExerciseAngina <chr>, Oldpeak <dbl>, ST_Slope <chr>,
## # HeartDisease <dbl>
## Age Sex ChestPainType RestingBP
## Min. :28.00 Length:918 Length:918 Min. : 0.0
## 1st Qu.:47.00 Class :character Class :character 1st Qu.:120.0
## Median :54.00 Mode :character Mode :character Median :130.0
## Mean :53.51 Mean :132.4
## 3rd Qu.:60.00 3rd Qu.:140.0
## Max. :77.00 Max. :200.0
## Cholesterol FastingBS RestingECG MaxHR
## Min. : 0.0 Min. :0.0000 Length:918 Min. : 60.0
## 1st Qu.:173.2 1st Qu.:0.0000 Class :character 1st Qu.:120.0
## Median :223.0 Median :0.0000 Mode :character Median :138.0
## Mean :198.8 Mean :0.2331 Mean :136.8
## 3rd Qu.:267.0 3rd Qu.:0.0000 3rd Qu.:156.0
## Max. :603.0 Max. :1.0000 Max. :202.0
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## Length:918 Min. :-2.6000 Length:918 Min. :0.0000
## Class :character 1st Qu.: 0.0000 Class :character 1st Qu.:0.0000
## Mode :character Median : 0.6000 Mode :character Median :1.0000
## Mean : 0.8874 Mean :0.5534
## 3rd Qu.: 1.5000 3rd Qu.:1.0000
## Max. : 6.2000 Max. :1.0000
Data Tibble.
## [1] TRUE
Tidy Data
## [1] 0
To identify and characterize the dataset thoroughly, we can utilize several R functions that provide detailed insights into the structure, summary statistics, distribution, and relationships between the variables
## spc_tbl_ [918 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Age : num [1:918] 40 49 37 48 54 39 45 54 37 48 ...
## $ Sex : chr [1:918] "M" "F" "M" "F" ...
## $ ChestPainType : chr [1:918] "ATA" "NAP" "ATA" "ASY" ...
## $ RestingBP : num [1:918] 140 160 130 138 150 120 130 110 140 120 ...
## $ Cholesterol : num [1:918] 289 180 283 214 195 339 237 208 207 284 ...
## $ FastingBS : num [1:918] 0 0 0 0 0 0 0 0 0 0 ...
## $ RestingECG : chr [1:918] "Normal" "Normal" "ST" "Normal" ...
## $ MaxHR : num [1:918] 172 156 98 108 122 170 170 142 130 120 ...
## $ ExerciseAngina: chr [1:918] "N" "N" "N" "Y" ...
## $ Oldpeak : num [1:918] 0 1 0 1.5 0 0 0 0 1.5 0 ...
## $ ST_Slope : chr [1:918] "Up" "Flat" "Up" "Flat" ...
## $ HeartDisease : num [1:918] 0 1 0 1 0 0 0 0 1 0 ...
## - attr(*, "spec")=
## .. cols(
## .. Age = col_double(),
## .. Sex = col_character(),
## .. ChestPainType = col_character(),
## .. RestingBP = col_double(),
## .. Cholesterol = col_double(),
## .. FastingBS = col_double(),
## .. RestingECG = col_character(),
## .. MaxHR = col_double(),
## .. ExerciseAngina = col_character(),
## .. Oldpeak = col_double(),
## .. ST_Slope = col_character(),
## .. HeartDisease = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
## # A tibble: 918 × 12
## SexF SexM ChestPainTypeATA ChestPainTypeNAP ChestPainTypeTA FastingBS1
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 1 1 0 0 0
## 2 1 0 0 1 0 0
## 3 0 1 1 0 0 0
## 4 1 0 0 0 0 0
## 5 0 1 0 1 0 0
## 6 0 1 0 1 0 0
## 7 1 0 1 0 0 0
## 8 0 1 1 0 0 0
## 9 0 1 0 0 0 0
## 10 1 0 1 0 0 0
## # ℹ 908 more rows
## # ℹ 6 more variables: RestingECGNormal <dbl>, RestingECGST <dbl>,
## # ExerciseAnginaY <dbl>, ST_SlopeFlat <dbl>, ST_SlopeUp <dbl>,
## # HeartDisease1 <dbl>
To generate insightful plots using ggplot2, we will create visualizations that highlight relationships and trends between categorical and quantitative variables in the heart disease dataset. These plots will include histograms, boxplots, and scatter plots with trend lines, which can help us uncover patterns and insights in the data.
Visualization Goals
Discription Distribution of Age by Heart Disease Status: Use histograms to show how age is distributed among those with and without heart disease. Boxplots of Cholesterol by Chest Pain Type: Compare cholesterol levels across different types of chest pain. Scatter Plot of Max Heart Rate vs. Age: Examine the relationship between age and maximum heart rate, differentiated by heart disease status. Bar Plot of Heart Disease by Sex: Show the proportion of heart disease across genders.
Age Distribution by Heart Disease Status
Description: This histogram shows the distribution of ages for individuals with and without heart disease. The blue bars represent individuals without heart disease (HeartDisease = 0), and the purple bars represent individuals with heart disease (HeartDisease = 1).
Interpretation:
Individuals with heart disease are generally older, with a noticeable increase in heart disease prevalence starting from the age of 40. The peak prevalence for individuals with heart disease is around 50 to 60 years. There are fewer young individuals (under 40) with heart disease, indicating age as a significant factor in heart disease prevalence.
Cholesterol Levels by Chest Pain Type
Description: This box plot shows the distribution of cholesterol levels for different chest pain types (ASY: Asymptomatic, ATA: Atypical Angina, NAP: Non-Anginal Pain, TA: Typical Angina).
Interpretation:
Individuals with asymptomatic (ASY) chest pain have a wider range of cholesterol levels, with many outliers indicating higher cholesterol. Those with typical angina (TA) and non-anginal pain (NAP) tend to have higher cholesterol levels compared to other chest pain types. Atypical angina (ATA) shows a more consistent range of cholesterol levels with fewer extreme values. Max Heart Rate vs. Age by Heart Disease Status
Description: This scatter plot shows the relationship between maximum heart rate and age, with different colors representing heart disease status (Red: No heart disease, Purple: Heart disease).
Interpretation:
There is a general decline in maximum heart rate with increasing age for both groups. Individuals without heart disease (red) tend to have higher maximum heart rates compared to those with heart disease (purple). This negative correlation suggests that as individuals age, their maximum heart rate decreases, and those with lower maximum heart rates are more likely to have heart disease.
Proportion of Heart Disease by Sex
Description: This bar plot shows the proportion of heart disease prevalence by sex. The pink bars represent individuals without heart disease (HeartDisease = 0), and the purple bars represent individuals with heart disease (HeartDisease = 1).
Interpretation:
Males (M) have a higher proportion of heart disease compared to females (F). The majority of females do not have heart disease, while a significant proportion of males do. This indicates that sex is an important factor, with males being more susceptible to heart disease. Overall Interpretation These visualizations provide important insights into the factors associated with heart disease:
Age: Older age groups have higher heart disease prevalence. Cholesterol: Higher cholesterol levels are more common in individuals with certain types of chest pain. Max Heart Rate: Lower maximum heart rates are associated with higher heart disease prevalence, especially in older individuals. Sex: Males are at a higher risk of developing heart disease compared to females.
To Visualize the Correlation Matrix, we Use a heatmap to visualize the correlations, highlighting both the strength and direction of relationships between variables.
This correlation matrix visualizes the relationships between different variables in the dataset. The color scale on the right indicates the strength and direction of the correlations, with values ranging from -1 to 1. Positive correlations are shown in shades of blue, while negative correlations are in shades of red.
Key Observations: Cholesterol:
Shows a weak positive correlation with MaxHR (0.24). Has a weak negative correlation with HeartDisease (-0.23), suggesting that higher cholesterol levels may slightly decrease the likelihood of heart disease. MaxHR:
Shows a moderate negative correlation with HeartDisease (-0.40), indicating that higher maximum heart rates are associated with a lower likelihood of heart disease. Has a weak negative correlation with Age (-0.38), meaning that older individuals tend to have lower maximum heart rates. FastingBS:
Shows a weak positive correlation with HeartDisease (0.27), suggesting that higher fasting blood sugar levels slightly increase the likelihood of heart disease. Has weak positive correlations with other variables like Age (0.20) and RestingBP (0.07). RestingBP:
Shows a weak positive correlation with HeartDisease (0.11), indicating a minor increase in heart disease likelihood with higher resting blood pressure. Has weak positive correlations with Age (0.25) and Oldpeak (0.16). Age:
Shows a weak positive correlation with HeartDisease (0.28), indicating that older age is associated with a higher likelihood of heart disease. Has weak to moderate positive correlations with other variables like RestingBP (0.25), Oldpeak (0.26), and FastingBS (0.20). Oldpeak:
Shows the strongest positive correlation with HeartDisease (0.40) among the variables, suggesting that higher ST depression is strongly associated with a higher likelihood of heart disease. Has weak positive correlations with other variables like Age (0.26) and RestingBP (0.16). Summary: MaxHR and Oldpeak are the most significant predictors of heart disease in this dataset, with MaxHR showing a moderate negative correlation and Oldpeak showing a moderate positive correlation with heart disease. Age and FastingBS also show weak positive correlations with heart disease, indicating that older age and higher fasting blood sugar levels are associated with a slightly higher likelihood of heart disease. Cholesterol and RestingBP show weaker correlations with heart disease, suggesting they are less influential predictors in this dataset. These correlations provide insights into the relationships between different health indicators and heart disease, helping to identify key factors that may contribute to the risk of heart disease.
##
## Call:
## lm(formula = HeartDisease ~ Age + Sex + RestingBP + Cholesterol +
## MaxHR + ExerciseAngina + Oldpeak + ST_Slope, data = heart_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0947 -0.1770 0.0085 0.1820 1.0592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4287953 0.1479534 2.898 0.00384 **
## Age 0.0032174 0.0013784 2.334 0.01980 *
## SexM 0.1855887 0.0292847 6.337 3.68e-10 ***
## RestingBP 0.0002044 0.0006501 0.314 0.75329
## Cholesterol -0.0006753 0.0001110 -6.082 1.75e-09 ***
## MaxHR -0.0013184 0.0005426 -2.430 0.01530 *
## ExerciseAnginaY 0.1916519 0.0281923 6.798 1.92e-11 ***
## Oldpeak 0.0592776 0.0131440 4.510 7.34e-06 ***
## ST_SlopeFlat 0.1545264 0.0483928 3.193 0.00146 **
## ST_SlopeUp -0.2651394 0.0537090 -4.937 9.46e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3458 on 908 degrees of freedom
## Multiple R-squared: 0.5213, Adjusted R-squared: 0.5166
## F-statistic: 109.9 on 9 and 908 DF, p-value: < 2.2e-16
The model explains approximately 51.66% (Adjusted_R-squared) of the variability in HeartDisease. Significant predictors (with p-values < 0.05) include Age, SexM, Cholesterol, MaxHR, ExerciseAnginaY, Oldpeak, ST_SlopeFlat, and ST_SlopeUp. RestingBP is not a significant predictor in this model. The overall model is significant (p-value < 2.2e-16). These insights can help in understanding which factors are most strongly associated with heart disease according to the model.
Residuals The residuals are the differences between the observed and predicted values of the dependent variable (HeartDisease). The summary statistics of the residuals give us an idea of the distribution:
Min: -1.0947 1Q (First Quartile): -0.1770 Median: 0.0085 3Q (Third Quartile): 0.1820 Max: 1.0592 The median residual is close to zero, suggesting that the model does not have a large systematic bias. The range (from -1.0947 to 1.0592) indicates how far the residuals typically deviate from zero.
Coefficients The coefficients table shows the relationship between each predictor and the outcome variable, controlling for the other predictors.
Intercept (0.4287953): The expected value of HeartDisease when all predictors are zero. However, this might not be meaningful if zero is not a realistic value for the predictors.
Age (0.0032174, p = 0.01980): For each additional year of age, the expected increase in HeartDisease is 0.0032 units, holding other variables constant. This is statistically significant (p < 0.05), indicating age is a meaningful predictor.
SexM (0.1855887, p = 3.68e-10): Being male increases the expected value of HeartDisease by 0.1856 units compared to females, holding other variables constant. This is highly significant (p < 0.001).
RestingBP (0.0002044, p = 0.75329): Resting blood pressure does not have a statistically significant effect on HeartDisease in this model (p > 0.1).
Cholesterol (-0.0006753, p = 1.75e-09): Each unit increase in cholesterol level decreases the likelihood of having HeartDisease by 0.0006753 units, which is highly significant (p < 0.001). This negative relationship might need further investigation.
MaxHR (-0.0013184, p = 0.01530): Each unit increase in maximum heart rate achieved decreases the expected value of HeartDisease by 0.0013184 units, and this is statistically significant (p < 0.05).
ExerciseAnginaY (0.1916519, p = 1.92e-11): Presence of exercise-induced angina increases the expected value of HeartDisease by 0.1916519 units, highly significant (p < 0.001).
Oldpeak (0.0592776, p = 7.34e-06): Each unit increase in ST depression induced by exercise relative to rest increases HeartDisease by 0.0592776 units, highly significant (p < 0.001).
ST_SlopeFlat (0.1545264, p = 0.00146): Having a flat slope of the peak exercise ST segment increases HeartDisease by 0.1545264 units, significant (p < 0.01). ST_SlopeUp (-0.2651394, p = 9.46e-07): Having an upsloping ST segment decreases HeartDisease by 0.2651394 units, highly significant (p < 0.001).
Summary of Regression Model Interpretation Model Explanation:
The model explains approximately 51.66% of the variability in heart disease outcomes, as indicated by the Adjusted R-squared value.
Significant Predictors:
The following predictors are statistically significant (p-values < 0.05), indicating a meaningful relationship with heart disease:
Age: Older age is associated with an increased likelihood of heart disease.
Sex (Male): Males have a higher likelihood of heart disease compared to females.
Cholesterol: Higher cholesterol levels are significantly associated with heart disease.
MaxHR (Maximum Heart Rate Achieved): Lower maximum heart rate achieved is associated with an increased likelihood of heart disease.
Exercise-Induced Angina (ExerciseAnginaY): Presence of exercise-induced angina is associated with a higher likelihood of heart disease.
Oldpeak: Higher ST depression induced by exercise relative to rest is associated with heart disease.
ST Slope (Flat and Up): Flat ST slope is associated with a higher likelihood, while upsloping ST segment is associated with a lower likelihood of heart disease.
Non-Significant Predictor:
Resting Blood Pressure (RestingBP): This predictor does not have a statistically significant effect on heart disease in this model.
Overall Model Significance:
The model as a whole is highly significant, with a p-value less than 2.2e-16, indicating that the predictors, when considered together, reliably predict heart disease.
Conclusion:
This model provides valuable insights into the factors most strongly associated with heart disease, which can be critical for understanding risk and informing preventative measures. The significant predictors highlight key areas for monitoring and intervention in clinical practice.
##
## studentized Breusch-Pagan test
##
## data: model_heart
## BP = 33.891, df = 9, p-value = 9.335e-05
The regression model explains 51.66% of heart disease variability (Adjusted_R-squared = 0.5166), highlighting significant predictors: Age, SexM, Cholesterol, MaxHR, ExerciseAnginaY, Oldpeak, ST_SlopeFlat, and ST_SlopeUp. Age between 51 to 66 and being male increase heart disease risk, while higher cholesterol slightly decreases it. Lower MaxHR and presence of exercise-induced angina raise heart disease likelihood. Higher Oldpeak (ST depression) and a flat ST slope increase risk, while an upsloping ST segment decreases it. RestingBP is not significant. The model is highly reliable (p-value < 2.2e-16), aiding in effective heart disease risk assessment and management. *** # {-}